This session will continue to explore the uses of ggplot2 and other packages that can improve your data visualization and statistical analyses.

Let’s load the required packages

packages_to_install <- c("tidyverse","gapminder","ggpubr","ggridges","ggreveal")

for (package in packages_to_install) {
  if (!(package %in% rownames(installed.packages()))) {
    install.packages(package, dependencies = TRUE)
    print(paste("Installed package:", package))
  } else {
    print(paste(package, "is already installed"))
  }
}
[1] "tidyverse is already installed"
[1] "gapminder is already installed"
[1] "ggpubr is already installed"
[1] "ggridges is already installed"
[1] "ggreveal is already installed"
library(tidyverse)
library(gapminder)
library(ggpubr)
library(ggridges)
library(ggreveal)

Now that we understand the building blocks that go into creating a plot with ggplot we can make our lives easier by setting a universal theme at the top of our script that will help keep our figures looking consistent.

theme_set(theme_bw()) # sets black and white theme for all plots generated in the document

theme_replace( # sets individual plot elements without overwriting theme
  axis.title.x = element_text(size = 14),
  axis.title.y = element_text(size = 14,
                              angle = 90,),
  axis.text.x = element_text(size = 12),
  axis.text.y = element_text(size = 12),
  legend.title = element_text(size = 14),
  legend.text = element_text(size = 12)
)

You can still manually set all of these plot elements in the plot code itself, but this means that every plot we make shouldn’t require quite so much coding to create a consistent visual style.

Create histograms, density plots, and violin plots

Let’s look at the basic scatter plot we created before using the Iris dataset.

ggplot(data = iris,
       aes(x = Petal.Length, 
           y = Petal.Width)) +
  geom_point(aes(color = Species))

Histograms offer several benefits in data analysis and visualization. They’re an important part of exploratory analysis of your data.

Let’s plot a basic histogram of Petal.Length, colored by Species

ggplot(data = iris,
       aes(x = Petal.Length)) +
  geom_histogram(aes(fill = Species))

We’ll add black outlines to the bars to make them easier to see

ggplot(data = iris,
       aes(x = Petal.Length)) +
  geom_histogram(
    aes(fill = Species),
    color = "black" # adds black outline to all bars
  ) -> iris_histo

Split the histogram into facets per species.

iris_histo +
  facet_grid(cols = vars(Species)) # with facet_grid we can choose whether to wrap in rows or columns


iris_histo +
  facet_grid(rows = vars(Species)) # this wraps the data in rows

Checking the normality of data

We can check normality of data using the Shapiro-Wilk Test. The shapiro.test function takes a single numeric vector as its argument and returns the result of Shapiro-Wilk test. The test evaluates the null hypothesis that the population from which the data are drawn is normally distributed. A P-value less than the significance level (usually 0.05) suggests that the data are not normally distributed.

lapply(iris[,1:4], # use lapply to apply the shapiro.test function to multiple columns
       function(x) shapiro.test(x)) -> normality_tests

# Print the results of normality tests. The results of the normality tests are stored in the normalityTests list, with variable names assigned using names().
names(normality_tests) <- colnames(iris[, 1:4])
print(normality_tests)
$Sepal.Length

    Shapiro-Wilk normality test

data:  x
W = 0.97609, p-value = 0.01018


$Sepal.Width

    Shapiro-Wilk normality test

data:  x
W = 0.98492, p-value = 0.1012


$Petal.Length

    Shapiro-Wilk normality test

data:  x
W = 0.87627, p-value = 7.412e-10


$Petal.Width

    Shapiro-Wilk normality test

data:  x
W = 0.90183, p-value = 1.68e-08

According to these test results, our only normally distributed data are the Sepal.Width measurements. Let’s check this by plotting histograms for each variable using hist(). The par(mfrow = c(2,2)) command sets up a 2 x 2 grid layout for the histograms, allowing them to be displayed in separate panels.

par(mfrow = c(2, 2))

hist(iris$Sepal.Length, main = "Sepal Length", xlab = "Length")

hist(iris$Sepal.Width, main = "Sepal Width", xlab = "Width")

hist(iris$Petal.Length, main = "Petal Length", xlab = "Length")

hist(iris$Petal.Width, main = "Petal Width", xlab = "Width")

A slightly more attractive (but possibly less informative) way to visualize this data is using density ridges from ggridges.

To use this, we’ll need to adjust the layout of the iris dataframe to a long format.

iris %>%
  pivot_longer(cols = 1:4, # select measurement columns
               names_to = "measurement",# set column name for variables
               values_to = "value") %>% # set column name for measurements
  ggplot(aes(x = value,
             y = measurement,
             fill = Species)) +
  geom_density_ridges(
    alpha = 0.5, # set opacity of ridges
    scale = 0.98 # set height scaling of ridges
  ) +
  scale_x_continuous(name = "Measurement (cm)") +
  scale_y_discrete(name = NULL,
                   labels = c("Petal.Length" = "Petal Length",
                                "Petal.Width" = "Petal Width",
                                "Sepal.Length" = "Sepal Length",
                                "Sepal.Width" = "Sepal Width"))

Other density plots

Another way to visualize your data is using density plots, also known as kernel density plots. There are several benefits to visualizing your data through density plots:

ggplot(data = gapminder, 
       aes(x = gdpPercap, 
           y = lifeExp)) +
  geom_bin2d(
    bins = c(50,50) # vectors for number of horizontal and vertical bins
  ) +
  scale_fill_viridis_c(
    name = "Number of countries"
  ) + # uses a viridis scale which is easier to visualize
  labs(x = "GDP per capita (USD)", # set x axis title
       y = "Life expectancy (years)") + # set y axis title
  theme(
    legend.position = "top", # set legend position
    legend.key.width = unit(2, "cm"), # set legend key width
    legend.title.position = "top" # move legend title 
  )

Adding statistics to your plots

ggplot has arguments for completing simple statistical tests through the extension ggpubr.

ggpubr offers additional functions and utilities for data visualization and statistical analysis, specifically designed to work seamlessly with ggplot2. Let’s add statistics to some box plots. Here we will compare mean life expectancy between continents using the gapminder dataset.

Reference: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/#google_vignette

First we need to check the normality of our data

shapiro.test(gapminder$lifeExp)

    Shapiro-Wilk normality test

data:  gapminder$lifeExp
W = 0.95248, p-value < 2.2e-16

Our P value is much lower than the 0.05 cutoff value which means our data are not normally distributed, so we need to take that into account when choosing a statistical test.

What test would you use to compare means across continents if the data were normally distributed?

ggplot(data = gapminder,
       aes(x = continent,
           y = lifeExp,
           color = continent)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(width = 0.2,
              alpha = 0.3) +
  stat_compare_means(
    method = "kruskal.test",
    label.y = 90 # sets position of P value label
  ) +
  guides( # guides allows us to finely manipulate legends
    color = guide_legend( # we want to adjust the color legend
      override.aes = aes(label = "") # remove the legend guide for the label
    )
  ) +
  scale_color_discrete(name = "Continent") +
  labs(x = NULL,
       y = "Life expectancy (years)") +
  theme(
    legend.position = "inside", # moves legend inside plot bounds
    legend.position.inside = c( 
      0.9,0.3 # values from 0-1 that set legend hor. and ver. position
    ), 
    legend.background = element_rect( # set outline color of legend box
      color = "black"
    )
  )

Our Kruskal-Wallis test tells us that there is a significant difference between groups, but not between which groups. If our data were normal and we performed an ANOVA we would perform a post-hoc test. Instead, we can use a pairwise Wilcoxon rank sum test.

pairwise.wilcox.test(gapminder$lifeExp,
                     gapminder$continent,
                     p.adjust.method = "BH")

    Pairwise comparisons using Wilcoxon rank sum test with continuity correction 

data:  gapminder$lifeExp and gapminder$continent 

         Africa  Americas Asia    Europe
Americas < 2e-16 -        -       -     
Asia     < 2e-16 3.5e-07  -       -     
Europe   < 2e-16 < 2e-16  < 2e-16 -     
Oceania  9.3e-16 5.7e-08  6.4e-10 0.047 

P value adjustment method: BH 

Now we can see that all continents are significantly different from one another!

Preparing plots for presentation

Sometimes we have very busy plots that would benefit from being revealed in sequence as part of storytelling in a presentation. For that, we can use ggreveal, which allows us to reveal parts of a plot sequentially and save each intermediate step for use. Let’s look at the changes in life expectancy over several decades.

gapminder %>%
  filter(year %in% c("1952","1962","1972","1982","1992","2002")) %>%
  filter(
    continent != "Oceania" # removing due to too few countries for density plotting
  ) %>%
  ggplot(aes(x = lifeExp,
             y = continent,
             fill = as.factor(year))) +
  geom_density_ridges(alpha = 0.5,
                      scale = 0.9) +
  labs(x = "Life expectancy (years)",
       y = "Continent") +
  scale_fill_discrete(name = "Year") -> life_exp

Now that we’ve created our life expectancy plot we can choose how to reveal the layers.

reveal_aes(life_exp,
           aes = "fill") -> plot_list

plot_list
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

You can click through the different plots that ggreveal has made. Now let’s save all these plots individually.

reveal_save(
  plot_list, # list of plots
  "life_exp.tiff"
)

Activities

Green 1

These activities will use the gapminder dataset which is a subset of country-by-country GDP from gapminder.org from 2010 and another of R’s built-in datasets, diamonds, which contains data on diamond size, clarity, and color.

Perform normality tests on the gapminder dataset. Look specifically at the normality of the data from Europe.

Green 2

Create a histogram using the diamonds data set. Facet so that they are aligned vertically.

Blue 1

Create a 2D bin (density) plot showing GDP Per Capita and Life Expectancy using geom_bind2d(). Then, come up with a way to show all five continents separately

Blue 2

Now add density lines to your plot using geom_density2d().

Black 1

Using the gapminder data, use one or more techniques to show the distribution of Life Expectancy over time for each continent.

Hint: Explore the geoms we’ve used to day, or try a new one! You can look up all available geoms here: https://ggplot2.tidyverse.org/reference/index.html

Black 2

Visualize specific comparisons between continents and label with brackets and P values or significance markers.

Hint: use the reference above to work out how to do this

---
title: "Data visualization and statistics"
author: "Yasmin Hilliam, PhD"
date: "2025-07-05"
output: html_notebook
---


This session will continue to explore the uses of `ggplot2` and other packages that can improve your data visualization and statistical analyses. 

Let's load the required packages

```{r}
packages_to_install <- c("tidyverse","gapminder","ggpubr","ggridges","ggreveal")

for (package in packages_to_install) {
  if (!(package %in% rownames(installed.packages()))) {
    install.packages(package, dependencies = TRUE)
    print(paste("Installed package:", package))
  } else {
    print(paste(package, "is already installed"))
  }
}

library(tidyverse)
library(gapminder)
library(ggpubr)
library(ggridges)
library(ggreveal)
```


Now that we understand the building blocks that go into creating a plot with `ggplot` we can make our lives easier by setting a universal theme at the top of our script that will help keep our figures looking consistent.

```{r}
theme_set(theme_bw()) # sets black and white theme for all plots generated in the document

theme_replace( # sets individual plot elements without overwriting theme
  axis.title.x = element_text(size = 14),
  axis.title.y = element_text(size = 14,
                              angle = 90,),
  axis.text.x = element_text(size = 12),
  axis.text.y = element_text(size = 12),
  legend.title = element_text(size = 14),
  legend.text = element_text(size = 12)
)
```


You can still manually set all of these plot elements in the plot code itself, but this means that every plot we make shouldn't require quite so much coding to create a consistent visual style.  

#### Create histograms, density plots, and violin plots

Let's look at the basic scatter plot we created before using the Iris dataset.

```{r}
ggplot(data = iris,
       aes(x = Petal.Length, 
           y = Petal.Width)) +
  geom_point(aes(color = Species))
```


Histograms offer several benefits in data analysis and visualization. They're an important part of exploratory analysis of your data.

- Visualizing Distribution and Symmetry: Histograms provide a visual representation of the distribution of a dataset. They allow you to see the shape, center, and spread of the data. By examining the histogram, you can quickly identify if the data are skewed, symmetric, or bimodal, which can provide insights into the underlying patterns and characteristics of the data. Understanding the skewness of the data can guide decisions regarding appropriate statistical methods and data transformations.

- Identifying Outliers: Histograms can help identify potential outliers or unusual observations in the data. Outliers appear as bars that are significantly separated from the main distribution or as isolated bars with very low frequencies. Detecting outliers is important for understanding data quality and assessing the impact they may have on statistical analyses.

- Data Exploration and Comparison: Histograms allow for the exploration and comparison of multiple variables or subgroups within a dataset. By creating separate histograms for different groups or categories, you can visually compare their distributions and understand the similarities or differences between them.

Let's plot a basic histogram of `Petal.Length`, colored by `Species`

```{r}
ggplot(data = iris,
       aes(x = Petal.Length)) +
  geom_histogram(aes(fill = Species))
```


We'll add black outlines to the bars to make them easier to see

```{r}
ggplot(data = iris,
       aes(x = Petal.Length)) +
  geom_histogram(
    aes(fill = Species),
    color = "black" # adds black outline to all bars
  ) -> iris_histo
```


Split the histogram into facets per species.

```{r}
iris_histo +
  facet_grid(cols = vars(Species)) # with facet_grid we can choose whether to wrap in rows or columns

iris_histo +
  facet_grid(rows = vars(Species)) # this wraps the data in rows
```


#### Checking the normality of data

We can check normality of data using the Shapiro-Wilk Test. The `shapiro.test` function takes a single numeric vector as its argument and returns the result of Shapiro-Wilk test. The test evaluates the null hypothesis that the population from which the data are drawn is normally distributed. A P-value less than the significance level (usually 0.05) suggests that the data are **not** normally distributed. 

```{r}
lapply(iris[,1:4], # use lapply to apply the shapiro.test function to multiple columns
       function(x) shapiro.test(x)) -> normality_tests

# Print the results of normality tests. The results of the normality tests are stored in the normalityTests list, with variable names assigned using names().
names(normality_tests) <- colnames(iris[, 1:4])
print(normality_tests)
```


According to these test results, our only normally distributed data are the Sepal.Width measurements. 
Let's check this by plotting histograms for each variable using `hist()`. The `par(mfrow = c(2,2))` command sets up a 2 x 2 grid layout for the histograms, allowing them to be displayed in separate panels.

```{r}
par(mfrow = c(2, 2))

hist(iris$Sepal.Length, main = "Sepal Length", xlab = "Length")

hist(iris$Sepal.Width, main = "Sepal Width", xlab = "Width")

hist(iris$Petal.Length, main = "Petal Length", xlab = "Length")

hist(iris$Petal.Width, main = "Petal Width", xlab = "Width")
```


A slightly more attractive (but possibly less informative) way to visualize this data is using density ridges from `ggridges`.

To use this, we'll need to adjust the layout of the `iris` dataframe to a long format.

```{r}
iris %>%
  pivot_longer(cols = 1:4, # select measurement columns
               names_to = "measurement",# set column name for variables
               values_to = "value") %>% # set column name for measurements
  ggplot(aes(x = value,
             y = measurement,
             fill = Species)) +
  geom_density_ridges(
    alpha = 0.5, # set opacity of ridges
    scale = 0.98 # set height scaling of ridges
  ) +
  scale_x_continuous(name = "Measurement (cm)") +
  scale_y_discrete(name = NULL,
                   labels = c("Petal.Length" = "Petal Length",
                                "Petal.Width" = "Petal Width",
                                "Sepal.Length" = "Sepal Length",
                                "Sepal.Width" = "Sepal Width"))
```


#### Other density plots

Another way to visualize your data is using density plots, also known as kernel density plots. There are several benefits to visualizing your data through density plots:

- Visualize distribution: Density plots provide a smooth, continuous estimate of the underlying probability density function (PDF) of a variable. Unlike histograms, which use discrete bins, density plots show the shape of the distribution without the binning bias. They provide a more detailed and nuanced representation of the data distribution.

- Non-Parametric Estimation: Density plots do not assume a specific distributional form for the data. They are non-parametric and estimate the density function based on the observed data points. This flexibility makes them suitable for exploring and visualizing data that may not adhere to a specific distribution.

- Comparison and Overlapping Distributions: Density plots make it easy to compare multiple distributions or visualize the overlap of different groups. By using different colors or line types, you can distinguish and compare distributions visually. This is particularly useful for identifying differences, similarities, or shifts in distributions across different groups or variables.

- Communicating Findings: Density plots provide an effective way to communicate distributional characteristics and patterns to others. They offer a visually appealing and intuitive representation of data distributions, making it easier for others to grasp the underlying shape and variability.

```{r}
ggplot(data = gapminder, 
       aes(x = gdpPercap, 
           y = lifeExp)) +
  geom_bin2d(
    bins = c(50,50) # vectors for number of horizontal and vertical bins
  ) +
  scale_fill_viridis_c(
    name = "Number of countries"
  ) + # uses a viridis scale which is easier to visualize
  labs(x = "GDP per capita (USD)", # set x axis title
       y = "Life expectancy (years)") + # set y axis title
  theme(
    legend.position = "top", # set legend position
    legend.key.width = unit(2, "cm"), # set legend key width
    legend.title.position = "top" # move legend title 
  )
```


#### Adding statistics to your plots

`ggplot` has arguments for completing simple statistical tests through the extension `ggpubr`. 

`ggpubr` offers additional functions and utilities for data visualization and statistical analysis, specifically designed to work seamlessly with `ggplot2`. Let's add statistics to some box plots. Here we will compare mean life expectancy between continents using the `gapminder` dataset.

Reference: http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/#google_vignette

First we need to check the normality of our data

```{r}
shapiro.test(gapminder$lifeExp)
```


Our P value is *much* lower than the 0.05 cutoff value which means our data are not normally distributed, so we need to take that into account when choosing a statistical test.

What test would you use to compare means across continents if the data *were* normally distributed? 


```{r}
ggplot(data = gapminder,
       aes(x = continent,
           y = lifeExp,
           color = continent)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(width = 0.2,
              alpha = 0.3) +
  stat_compare_means(
    method = "kruskal.test",
    label.y = 90 # sets position of P value label
  ) +
  guides( # guides allows us to finely manipulate legends
    color = guide_legend( # we want to adjust the color legend
      override.aes = aes(label = "") # remove the legend guide for the label
    )
  ) +
  scale_color_discrete(name = "Continent") +
  labs(x = NULL,
       y = "Life expectancy (years)") +
  theme(
    legend.position = "inside", # moves legend inside plot bounds
    legend.position.inside = c( 
      0.9,0.3 # values from 0-1 that set legend hor. and ver. position
    ), 
    legend.background = element_rect( # set outline color of legend box
      color = "black"
    )
  )
```


Our Kruskal-Wallis test tells us that there is a significant difference between groups, but not between which groups. If our data were normal and we performed an ANOVA we would perform a post-hoc test. Instead, we can use a pairwise Wilcoxon rank sum test.

```{r}
pairwise.wilcox.test(gapminder$lifeExp,
                     gapminder$continent,
                     p.adjust.method = "BH")
```


Now we can see that all continents are significantly different from one another!

#### Preparing plots for presentation

Sometimes we have very busy plots that would benefit from being revealed in sequence as part of storytelling in a presentation. For that, we can use `ggreveal`, which allows us to reveal parts of a plot sequentially and save each intermediate step for use. Let's look at the changes in life expectancy over several decades. 

```{r}
gapminder %>%
  filter(year %in% c("1952","1962","1972","1982","1992","2002")) %>%
  filter(
    continent != "Oceania" # removing due to too few countries for density plotting
  ) %>%
  ggplot(aes(x = lifeExp,
             y = continent,
             fill = as.factor(year))) +
  geom_density_ridges(alpha = 0.5,
                      scale = 0.9) +
  labs(x = "Life expectancy (years)",
       y = "Continent") +
  scale_fill_discrete(name = "Year") -> life_exp
```


Now that we've created our life expectancy plot we can choose how to reveal the layers.

```{r}
reveal_aes(life_exp,
           aes = "fill") -> plot_list

plot_list
```


You can click through the different plots that `ggreveal` has made. Now let's save all these plots individually.

```{r}
reveal_save(
  plot_list, # list of plots
  "life_exp.tiff"
)
```


### Activities 

#### Green 1

These activities will use the `gapminder` dataset which is a subset of country-by-country GDP from gapminder.org from 2010 and another of R's built-in datasets, `diamonds`, which contains data on diamond size, clarity, and color.

Perform normality tests on the gapminder dataset. Look specifically at the normality of the data from Europe.

```{r}

```


#### Green 2

Create a histogram using the `diamonds` data set. Facet so that they are aligned vertically.

```{r}

```


#### Blue 1

Create a 2D bin (density) plot showing GDP Per Capita and Life Expectancy using `geom_bind2d()`. Then, come up with a way to show all five continents separately

```{r}

```


#### Blue 2

Now add density lines to your plot using `geom_density2d()`.

```{r}

```


#### Black 1

Using the gapminder data, use one or more techniques to show the distribution of Life Expectancy over time for each continent. 

**Hint: Explore the geoms we've used to day, or try a new one!**
You can look up all available geoms here: https://ggplot2.tidyverse.org/reference/index.html

```{r}

```


#### Black 2

Visualize specific comparisons between continents and label with brackets and P values or significance markers. 

**Hint: use the reference above to work out how to do this**

```{r}

```